生成式建模入门：超越判别分析

我们正从判别式建模，通过学习条件概率 $P(y|x)$ 来解决分类与回归任务，转向更复杂的生成式建模领域。我们的核心目标现在转向密度估计：学习数据本身的完整底层分布 $P(x)$。这一根本性转变使我们能够捕捉高维数据集中错综复杂的依赖关系和结构特征，不再局限于简单的边界划分，而是实现对数据的真正理解与生成合成。

1. 生成式目标：建模 $P(x)$

生成式模型的目标是估计训练数据 $X$ 所源自的概率分布 $P(x)$。一个成功的生成式模型能够完成三项关键任务：(1) 密度估计（为输入 $x$ 分配概率得分），(2) 采样（生成全新的数据点 $x_{new} \sim P(x)$），以及 (3) 无监督特征学习（在潜在空间中发现有意义且解耦的表示）

2. 分类：显式与隐式似然

生成式模型从根本上根据其对似然函数的方法进行分类。显式密度模型，例如变分自编码器（VAEs）和流模型，定义了一个数学似然函数并尝试最大化它（或其下界）。隐式密度模型，最著名的例子是生成对抗网络（GANs）则完全跳过似然计算，转而学习一种映射函数，通过对抗训练框架从分布 $P(x)$ 中采样。

Data Synthesis and Feature Interpolation

Generative models demonstrate their capability by generating novel, high-fidelity instances (e.g., unseen faces, complex textures) or by allowing semantic interpolation in the learned latent space, illustrating the model's grasp of data variability.

Examples of AI-generated faces and interpolated features.

Question 1

In generative modeling, what is the primary distribution of interest?

$P(x)$

$P(y|x)$

$P(x|y)$

$P(y)$

Question 2

Which type of generative model relies on adversarial training and avoids defining an explicit likelihood function?

Variational Autoencoder (VAE)

Autoregressive Model

Generative Adversarial Network (GAN)

Gaussian Mixture Model (GMM)

Challenge: Anomaly Detection

Leveraging Density Estimation

A financial institution has trained an explicit density generative model $G$ on millions of legitimate transaction records. A new transaction $x_{new}$ arrives.

Goal: Determine if $x_{new}$ is an anomaly (fraud).

Step 1

Based on the density estimate of $P(x)$, what statistical measure must be evaluated for $x_{new}$ to flag it as anomalous?

Solution:
The model must evaluate the probability (or likelihood) $P(x_{new})$. If $P(x_{new})$ falls below a predefined threshold $\tau$, meaning the new point is statistically improbable under the learned distribution of normal transactions, it is flagged as an anomaly.